Cell type A | Cell type B | Cell type C | Cell type D | |
---|---|---|---|---|
Gene 0 | 100 | 200 | 50 | 400 |
Gene 1 | 50 | 0 | 0 | 100 |
Gene 2 | 350 | 100 | 50 | 200 |
In [ ]:
gene0 = [100, 200, 50, 400]
gene1 = [50, 0, 0, 100]
gene2 = [350, 100, 50, 200]
expression_data = [gene0, gene1, gene2]
Why is this a bad idea?
In [ ]:
import numpy as np
a = np.array(expression_data)
print(a)
We are going to:
using the awesome power of NumPy
In [ ]:
def print_info(a):
print('number of elements:', a.size)
print('number of dimensions:', a.ndim)
print('shape:', a.shape)
print('data type:', a.dtype)
print('strides:', a.strides)
print('flags:')
print(a.flags)
print_info(a)
In [ ]:
print(a.data)
In [ ]:
abytes = a.ravel().view(dtype=np.uint8)
In [ ]:
print_info(abytes)
In [ ]:
print(abytes[:24])
In [ ]:
print_info(a)
In [ ]:
print_info(a.T)
In [ ]:
print_info(a.T)
In [ ]:
print_info(a.T[::2])
In [ ]:
print_info(a.T[::2, ::2])
In [ ]:
b = a
In [ ]:
print(b)
In [ ]:
a[0, 0] = 5
print(b)
a[0, 0] = 100
In [ ]:
expr = np.load('expr.npy')
In [ ]:
print_info(expr)
This has the raw read count data. However, each sample gets a different number of reads, so we want to normalize by the library size, which is the total number of reads across a column.
The np.sum
function returns the sum of all the elements of an array. With the axis
argument, you can take the sum along the given axis.
In [ ]:
lib_size = np.sum(expr, axis=0)
In [ ]:
In order to normalize every column by its corresponding library size, we have to align the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use np.newaxis
to match the dimensions.
In [ ]:
print(expr.shape)
print(lib_size.shape)
print(lib_size[np.newaxis, :].shape)
However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:
In [ ]:
np.all(expr / lib_size ==
expr / lib_size[np.newaxis, :])
In [ ]:
expr_lib = expr / lib_size
We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).
In [ ]:
expr_lib *= 1e6
Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).
In [ ]:
gene_len = np.load('gene-lens.npy')
print(gene_len.shape)
In [ ]:
rpkm = expr_lib # FIX THIS
In [ ]:
from matplotlib import pyplot as plt
from scipy import stats
def plot_col_density(data, xlim=None, *args, **kwargs):
# Use gaussian smoothing to estimate the density
density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
if xlim is not None:
m, M = xlim
else:
m, M = np.min(data), np.max(data)
x = np.linspace(m, M, 100)
plt.figure()
for density in density_per_col:
plt.plot(x, density(x), *args, **kwargs)
plt.xlabel('log-counts')
plt.ylabel('frequency')
if xlim is not None:
plt.xlim(xlim)
plt.show()
In [ ]:
%matplotlib inline
In [ ]:
plt.style.use('ggplot')
In [ ]:
plot_col_density(np.log(expr+1))
In [ ]:
plot_col_density(np.log(rpkm + 1), xlim=(0, 250))
In [ ]:
x = np.random.rand(3, 5)
y = np.random.randint(10, size=8)
z = x # FIX THIS
In [ ]:
In [ ]:
def repeat(arr, n):
return np.lib.stride_tricks.as_strided(arr,
shape=(n,) + arr.shape,
strides=(0,) + arr.strides)
In [ ]:
repeat(np.random.rand(5), 4)
In [ ]:
def sliding_window(arr, size=2):
"""Produce an array of sliding window views of `arr`
Parameters
----------
arr : 1D array, shape (N,)
The input array.
size : int, optional
The size of the sliding window.
Returns
-------
arr_slide : 2D array, shape (N - size - 1, size)
The sliding windows of size `size` of `arr`.
Examples
--------
>>> a = np.array([0, 1, 2, 3])
>>> sliding_window(a, 2)
array([[0, 1],
[1, 2],
[2, 3]])
"""
return arr # fix this
In [ ]:
# test your code here
sliding_window(np.arange(8), 3)
You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.
In [ ]:
values = np.array([0, 5, 99])
selector = np.random.randint(0, 3, size=(3, 4))
print(selector)
print(values[selector])
Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.
Hint: look for documentation for scipy.mstats.rankdata
, np.sort
, and np.argsort
.
In [ ]:
def qnorm(x):
"""Quantile normalize an input matrix.
Parameters
----------
x : 2D array of float, shape (M, N)
The input data, with each column being a
distribution to normalize.
Returns
-------
xn : 2D array of float, shape (M, N)
The normalized data.
"""
xn = np.copy(x) # replace this by normalizing code
return xn
In [ ]:
logexpr = np.log(expr + 1)
logrpkm = np.log(rpkm + 1)
In [ ]:
logexprn = qnorm(logexpr)
logrpkmn = qnorm(logrpkm)
In [ ]:
plot_col_density(logexprn)
In [ ]:
plot_col_density(logrpkmn, xlim=(0, 0.25))
(If time permits.)
Date: Wed, 16 Jul 2008 16:45:37 -0500
From: Jack Cook
To: <numpy-discussion@scipy.org>
Subject: Numpy Advanced Indexing Question
Greetings,
I have an I,J,K 3D volume of amplitude values at regularly sampled time intervals. I have an I,J 2D slice which contains a time (K) value at each I, J location. What I would like to do is extract a subvolume at a constant +/- K window around the slice. Is there an easy way to do this using advanced indexing or some other method? Thanks in advanced for your help.
-- Jack
In [ ]:
# "data"
ni, nj, nk = (10, 15, 20)
amplitude = np.random.rand(ni, nj, nk)
horizon = np.random.randint(5, 15, size=(ni, nj))
In [ ]:
An author of a foreign package (included with the exercizes as
problems/mutable_str.py
) provides a string class that
allocates its own memory:
ipython
In [1]: from mutable_str import MutableString
In [2]: s = MutableString('abcde')
In [3]: print s
abcde
You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.
Add an array_interface dictionary attribute to s, then convert s to an
ndarray. Numerically add "2" to the array (use the in-place operator +=
).
Then print the original string to ensure that its value was modified.
Hint: Documentation for NumPy's
__array_interface__
may be found in the online docs.
Here's a skeleton outline:
In [ ]:
import numpy as np
from mutable_str import MutableString
s = MutableString('abcde')
# --- EDIT THIS SECTION ---
# Create an array interface to this foreign object
s.__array_interface__ = {'data' : (XXX, False), # (ptr, is read_only?)
'shape' : XXX,
'typestr' : '|u1', # typecode unsigned character
}
# --- EDIT THIS SECTION ---
print 'String before converting to array:', s
sa = np.asarray(s)
print 'String after converting to array:', sa
sa += 2
print 'String after adding "2" to array:', s
In [ ]: